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Abstract 

The paper describes an application of the tree classification method Random For- 
est (RF), as used in the analysis of data from the ground-based gamma telescope 
MAGIC. In such telescopes, cosmic 7-rays are observed and have to be discrimi- 
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nated against a dominating background of hadronic cosmic-ray particles. We de- 
scribe the application of RF for this gamma/hadron separation. The RF method 
often shows superior performance in comparison with traditional semi-empirical 
techniques. Critical issues of the method and its implementation are discussed. An 
application of the RF method for estimation of a continuous parameter from related 
variables, rather than discrete classes, is also discussed. 

Key words: discrimination, classification, decision tree 



1 Introduction 

Ground-based gamma-ray astronomy has in recent years shown to be a source 
of spectacular discoveries, constraining the evohition of the universe and con- 
tributing to the understanding of the origin of cosmic rays. Observations are 
based on Imaging Atmospheric Cherenkov Telescopes (lACTs), which take 
advantage of the Cherenkov radiation emanating from the electromagnetic 
showers that develop during the absorption of gamma-rays in the atmosphere. 
The faint Cherenkov light flashes are collected in a large-diameter mirror, and 
recorded in a pixelized camera. 

Several lACT systems are in successful operation today, both in the Northern 
(MAGIC, VERITAS) and Southern (HESS, CANGAROO) hemisphere; all 
but MAGIC are implemented as multi-telescope arrays. Their scientfic goals 
include galactic and extragalactic sources: Supernova remnants. Pulsars, X- 
ray binaries, Microquasars, Active Galactic Nuclei (blazars or radio galaxies), 
Starburst galaxies and potentially also Gamma Ray Bursts. Due to their small 
aperture lACTs can only perform scans over small areeas, and usually con- 
centrate on sources that have been identified at other wavelengths; however, 
the number of known gamma-ray emitters is increasing fast, and they provide 
essential contributions to the understanding of the non-thermal universe. 

Events seen by an lACT have a very short (r^ 2ns) duration, and the shower 
image is recorded as a compact cluster of pixels in the camera of the lACT. 
A principal component analysis permits to express the characteristics of this 
cluster in image parameters, which will present statistically different properties 
for the (interesting) gamma-rays and the (dominating) hadronic background. 
lACTs provide raw data with a signal-to-noise ratio much smaller than 1%, 
even for bright gamma sources. Establishing powerful methods of hadronic 
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background rejection thus is a prerequisite for the effective utihzation of ob- 
servations with the Cherenkov technique. The fact was recognized with the 
advent of the lACT technique, and has been given ample room in the hter- 
ature, both for telescope arrays and single telescopes, e.g. [1-5]. Multivariate 
methods using global test statistics (e.g. likelihood ratios or artificial neural 
networks) are specifically mentioned in [3] and [5]. 

A case study for and comparison of different advanced classification methods 
for a single-dish lACT can be found in [6] . In the same article the main features 
of Cherenkov images measured by gamma-ray telescopes are addressed and 
explained, and the image parameters used in the 7/h separation are defined. 

In this paper, largely derived from chapter 5 of [7], we limit ourselves to the 
implementation, usage, and functionality of the RF method for the single- 
dish system MAGIC [8]. In [7], a more detailed discussion of the RF method 
and comprehensive MC studies are given. The implementation closely follows 
the method desribed by L. Breiman [9]. The application in 7/h separation 
is discussed in detail. Recent MAGIC publications (e.g. [10-12] use the RF 
technique, and [13] dicusses it in the context of the reference observations of 
the Crab nebula. A short comparative study with the established method of 
cuts in scaled image parameters is given. We also discuss an application of the 
RF method in estimating the gamma energy, a continuous variable, in terms of 
the observed image parameters. In the following chapter [2] the Random Forest 
method will be described in detail, since existing mathematical treatments 
show only few practically useful aspects, if any. The reader not interested in 
these details may regard RF as a black-box tree classification method, and 
continue with the results in section 14.21 



2 Basics of the Random Forest (RF) method 

The Random Forest method is based on a collection of decision trees, built 
up with some elements of random choices. Like many other classification and 
regression methods, a Random Forest is constructed on the basis of train- 
ing samples suitable for the application. For the purpose of 7/h separation, 
the training samples contain the two classes of gammas (usually Monte Carlo 
(MC) data) and hadrons (usually OFF data, also ONp~| or MC data are pos- 
sible). In the further discussion, the following definitions will be used: We call 
the elements of the training sample events. Each event is characterized by 
a vector whose components are image parameters obtained by analyzing the 
camera pixels. We use the familiar Hillas parameters [1] and some additional 

^ ON and OFF data are telescope data obtained by pointing at the source or on a 
nearby, sourceless region of the sky, respectively 
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parameters, but also observation- and detector-related parameters, like cos{9), 
9 being the zenith angle of the source. The space spanned by the event vectors 
is multi-dimensional. One can consider the training samples of gammas and 
hadrons as a single labeled training sample, viz. each event has an integer la- 
bel (called hadronness) indicating if the event belongs to the class of gammas 
(hadronness 0) or to the class of hadrons (hadronness 1). 

From this sample, a binary decision tree can be constructed, subdividing the 
parameter space first in two parts depending on one of the parameters, and 
subsequently repeating the process again and again for each part. The best 
choice of parameter and the criteria for subdividing are discussed below. Us- 
ing a single tree for classification purposes, however, usually gives mediocre 
results. The tree is overoptimized on the training sample, and there is only 
poor generalization viz. new events will be classified rather badly. This is 
shown in figure [l} Note, however, that even a set of trees (forest) results in 
some sparsely populated areas, where the hadronness necessarily is not well 
defined, and the probability of misclassification may be substantial. 




Fig. 1. Left: Illustration of the RF method for a simple 2-dimensional model case. 
The black and white points are the observed points in class gamma and hadrons, 
respectively. They are distributed according to two different, but overlapping 2-di- 
mensional Gaussians. The result of separation in terms of hadronness is shown in 
colour. Right: The result of using a single tree on the same data gives no probability 
measure like hadronness, but only y/n answers. Rs performance is inadequate. 



There is no pruning (tree simplification by removing some branches consid- 
ered irrelevant) of the trees in the Random Forest algorithm. Instead, the 
RF creates a set of largely uncorrelated trees, and combines their results to 
form a generalized predictor. Two random elements prior to and within the 
tree growing process serve to approximate ideally uncorrelated trees; they are 
described in the following sections. 
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2.1 Bootstrap aggregating (bagging) 



There is usually a single data sample in each class used for training. A straight- 
forward solution to obtain independent trees is to split the training sample 
into as many non-overlapping subsamples as trees should be grown. However, 
there are usually not enough training data available for this approach. This is 
especially the case if dealing with air shower data, which are always costly to 
generate (w.r.t. computer time and storage space). A different way is to pro- 
duce a bootstrap sample for each tree by sampling n times with replacement 
from the original training sample containing n events. This procedure guaran- 
tees that the events' image parameter distributions are statistically identical 
for all bootstrap samples (and equal to the image parameter distributions of 
the original training sample, since the probability of selecting an event is con- 
stantly 1/n for the sampling with replacement procedure) , while the bootstrap 
samples do not contain the same events. It may (and will) happen that cer- 
tain events are taken more than just once: The probability of not selecting a 
certain event is equal to (1 — l/n), which becomes (1 — 1/n)" when repeating 
the selection process n times. As /im„^oo(l + x/n) = e^, the probability of 
not selecting an event in the bootstrap procedure becomes 1/3. Thus, 

in each bootstrap sample there will be on average (1 — 1/e) original training 
events, the rest (also kept in the sample) are copies. 



2.2 Tree growing and random split selection 

The tree growing begins with the complete sample contained in a single node, 
the so called root node, which is identical to the complete image parameter 
space. In the following the 7/h separation is achieved by splitting (or cutting) 
each node into two successor nodes using one of the image parameters at a 
time, with a cut value optimized to separate the sample into its classes (in 
our case two: gammas and hadrons). This corresponds to a successive divi- 
sion of the image parameter space into hypercubes. In order to measure the 
classification power (separation ability) of an image parameter and to opti- 
mize the cut value, the Gini index is used The Gini index is a frequently used 
measure in dealing with classifiers, originally in economics. Named after the 
Italian economist Corrado Gini, it measures the inequality of two distribu- 
tions, e.g. gamma acceptance and hadron acceptance as function of a cut in 
a variable. It is defined as the ratio between a) the area spanned by the ob- 
served cumulative distribution and the hypothetical cumulative distribution 
for a non-discriminating variable (uniform distribution, 45-degree line), and b) 
the area under this uniform distribution. It is a variable between zero and one; 
a low Gini coefficient indicates more equal distributions, a high Gini coefficient 
shows unequal distribution. 
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The choice of the parameter taken for sphtting is randomized (see below for 
details). The splitting process stops if the node size (events per node) falls 
below a limit specified by the user, or if there are only events of one class 
(only gammas or only hadrons) left in the node, which therefore needs not be 
split further. These terminal nodes can also be called elementary hypercubes, 
they cover the entire image parameter space without intersections or gaps. 
To each terminal node the remaining training events assign a class label / (0 
for gammas, 1 for hadrons). For terminal nodes still containing a mixture of 
events of different classes, a mean value is calculated for /, taking into account 
the class populations Nh of hadrons and A^^ of gammas: / = Nh/{Nh + N^). 
The original program [14] uses a majority vote, and does not calculate mean 
values. 

Before going into more details, the classification process is briefly described: 
One can take a completely grown tree as starting point (see figure [2]). The 



Fig. 2. Sketch of a tree structure for the classification of an event v with components 
viength, Vyjidth, CL^d Vgize- One Can follow the decision path through the tree, leading 
to classification of the event as hadron. 

task is to classify an event characterized by a vector v in the image parameter 
space. V is fed into the decision tree; at the first (highest level) node there is a 
split in a certain image parameter (e.g. 'length'). Depending on the component 
(image parameter) 'length' in f , the event v proceeds to the left node (length 
< split value) or to the right node (length > split value) at the next lower level. 
This node again splits in some other (or by chance the same) component, and 
the process continues. The result is that v follows a track through the tree 
determined by the numerical values of its components, and the tree nodes' 
cut values, until it will end up in a terminal node. This terminal node assigns 
a class label / to f , which can now be denoted as li{v), where i is the tree 
number. 

The vector v will be classified by all trees. Due to the randomization involved, 
different trees will often give different results, hence the name 'Random Forest'. 
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From these results, a mean classification is calculated: 



h{v) 



ritr 



(1) 



This mean classification is called Hadronness, and is used as the only test 
statistic (split-parameter) in the 7/h separation (see figure [s]). 




Fig. 3. Mean hadronness for two test samples of gammas (left peak, black) and 
hadrons (right peak, red). Hadronness is the final and only test statistic in 7//1 
separation. 



The splitting process is somewhat randomized by a feature called random split 
selection. The parameter candidates for a split are chosen randomly from the 
total number of available parameters. Among the candidates, the parameter 
and corresponding cut value to be used for splitting are chosen by the minimal 
Gini index. In the case of two classes, the Gini index Qcini can be referred to 
as binomial variance of the sample scaled to the interval [0, 1]. The Gini index 
(or GINI coefficient) can be expressed in terms of the node class populations 
N^, Nfi and the total node population A^: 



Q 



Gini 



N 



(^binomial 



N N 



N^{N-N^ 

iV2 



G [0, 1] 



(2) 



Qcini of a node is zero for the ideal case that only one class is present in the 
node {N^ = or = 0) . The Gini index of the split is calculated by adding 
the Gini indices of the two successor nodes (denoted by left and right node) 
and scaling the result to [0,1]: 



Q 



Gini 



N^left Nhleft _|_ N^right Nhi •ight 



^right ^right 



e [0, 1] 



(3) 
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Choosing the smallest Qcini corresponds to minimizing the variance of the 
population of gammas and hadrons, and naturally purifies the sample. Min- 
imization of the Gini index provides both the choice of the image parameter 
and the split value to be used. 

More details concerning the Random Forest method can be found in [14] . The 
original program was modified to calculate the mean hadronness instead of 
a or 1 majority vote for a class. Calculating the arithmetic mean by using 
weights (e.g. using the Gini index of terminal nodes) did not further improve 
the results [6], [7]. 



3 Control of the training process 



In this chapter we address some specific aspects of RF related to the training 
process. Proper training depends on several parameters, steering the growing 
of trees, which the user should be aware of. In the following these parameters 
are described. 

• Number of trees: the number of trees must be chosen large enough to ensure 
the convergence of the error cr, given by 



E ^sample f 1 est ( ^ \ t,true'\2 
1=1 VH ['Hree) ~ '^i ) ^^^^ 

\ ^sample 



o'i'ritree) is the rms error of the estimated hadronness. hl^*{ntree) denotes 
the estimated hadronness (which depends on the number ritree of combined 
trees) and /i*™^ is the true hadronness of event i in the sample, which 
contains Usampie events in total. The convergence process is shown in figure [4] 
for the training of RF on an MC gamma and MC hadron sample. 

Care was taken that the test sample, for which the figure was produced, is 
disjunct from the training sample. When taking events already used in the 
training process, a would be underestimated. From figure |4| the following 
practical method can be deduced: One generates a reasonably high num- 
ber of trees (100 trees is usually sufficient), performs the training process, 
and then finds decisions for a test sample using a diminishing number of 
trees, to judge how many trees still give satisfactory results. Trees generated 
during the training process are stored successively in a file. For the classifi- 
cation task one can read in the actually needed number of trees. If no test 
sample is available, one can take (j{ntree) as calculated from the so-called 
out-of-bag data during the training. The out-of-bag data are the 'residue' 
of the bagging procedure, as explained in the following. In the bagging pro- 
cedure (generating of bootstrap samples, see chapter [2]) there are data for 
each tree which have not been used for the tree's bootstrap sample. Being 
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Fig. 4. Error (rms, = the estimated, hadronness as function of the number 

of trees used. Also shown is the variance of each single tree. 



independent, they can be used as test data for the corresponding tree. In 
other words, each event of the original training sample can be used as test 
data for ^ 1/3 of the trees. If one observes a sufficient convergence of a 
calculated from out-of-bag data after, say, 150 trees, actually 50 trees are 
needed. 

Overtraining: During tree growing, the cut values of the parameters are 
adjusted according to the training sample. This overtraining is not a major 
drawback, it affects merely the training sample, which provides these exact 
cut values. According to [14] the overtraining (or overoptimization) vanishes 
in case of an infinite number of trees. The practical method described above 
favours a minimal forest, with a number of trees sufficiently large to ensure 
a classification error (of a test sample), which is not significantly decreased 
by adding more trees. Such a forest still shows overtraining: when applying 
7/h separation to the training data, the classes of gammas and hadrons can 
usually be well separated by a cut in hadronness = 0.5. In other words, 
each tree 'learned by heart' the training events, and the same is true for the 
entire forest. The situation is the same with classical cuts: the cut values 
are optimized on a certain observed data set from a gamma source or on 
Monte Carlo data, and later on applied to the data to be analyzed, which 
must not contain the training data. 

Number of trials in random split selection: This concerns the parameters 
considered for splitting. A good empirical value for their number is 
where is the total number of parameters used in tree growing [14]. 
Node size: this is the minimum size of node at which further splitting stops. 
For correctly labeled training events nodesize = 1 can be used, for partly 
incorrect labeled data (e.g. using ON-data as hadrons) nodesize > 1 is 
preferable, since data are not intended to be split completely. Experience 
tells that a small number < 10 is best. 
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4 Application of RF in 7/h separation 



4-1 Remarks concerning the training process 

In this chapter some features related to the Random Forest method will be 
briefly addressed. Some of these remarks are valid also for many other ad- 
vanced classification methods in need of a training process, like Neural Net- 
works or linear discriminant analysis. 

• Training data for Cherenkov telescopes: We have used OFF data and MC 
gammas (correctly labeled samples) or ON data and MC gammas (partly 
wrongly labeled hadron sample). It is usually advisable not to use MC 
hadrons, since hadronic showers are diflicult to simulate (unlike gamma 
showers which have a pure electromagnetic nature), so that MC hadrons 
are diflicult to match in all details with real data. In fact, there is no need 
to use MC hadrons, when OFF or ON data are available. Choosing ON data 
for training has the advantage of obviating OFF data taking, and of using 
data taken under identical observational conditions. The Random Forest al- 
gorithm is stable enough to deal with a hadron sample containing up to 1% 
of gammas, as shown in flgure [5} where the training was performed using 
OFF data with variable artiflcial contamination for the hadrons, and MC 
data for the gamma sample. In order to simulate ON data, the OFF data 

I random forest, OFF dam conlaminaied by MC-gammas | 
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Fig. 5. Neyman-Pearson or ROC diagrams of hadron training samples with a con- 
tamination of (mislabeled) gamma events. A hadron sample with 1% gammas intro- 
duces a negligible loss in selection efficiency. 

were contaminated with MC gammas, i.e. the degree of contamination was 
known. For all simulated gamma admixtures the reduction of the separation 
efliciency beomes visible only in a region of low gamma acceptances, which 
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is usually not advisible to operate in (too low gamma efficiency). Depending 
on the set of image parameters used for training, a generalization of this 
result may not be possible. 
• Types of parameters: All parameters are treated in the same way, which 
means that in particular detector-related or observational parameters like 
cos{9) {9 being the zenith angle), a (image noise, averaged over all pixels), 
or size (integrated signal of the image), must be used with care. The sense 
of using such parameters is that cuts in other image parameters will depend 
on them, but not that they should be used for cuts. Thus, in general, one 
can distinguish between parameters to be used for cuts, and parameters 
on which the cuts in other parameters may depend. To circumvent the 
problem, the training data must be chosen not to permit a classification 
using these parameters alone (e.g. by using the same (flat) distribution of 
cos{9) in both training samples). Splits in these parameters, in training 
samples prepared this way, can not directly serve for separating gammas 
and hadrons. Additional attention must also be payed if e.g. the gamma 
data have discrete cos{9) values for technical reasons in the Monte Carlo 
production. In this case the cos{9) values appearing in the hadron sample 
must be rounded to the same values (binned), or the Monte Carlo data 
artificially spread to become continuous. 



4.-2 Comparison with direct cuts in image parameters 

An extensive comparison of methods applied to Monte Carlo data sets for 
training and test samples was given in [6]. One of the methods described 
there (called Direct Selection) was based on using simple AND/OR cuts in 
the multi-dimensional space of image parameters. The choice of parameters or 
functions thereof offers many possibilities for tuning. We repeat here a similar 
comparison, again using Monte Carlo data, using scaled image parameters. 
Like in [6], no claim can be made that this result, found in favor of the RF 
method, can be generalized to all parameter choices or to real data. Exhaustive 
comparisons with real data are lengthy, due to the high dimensionality of 
the problem, which includes data selection and image cleaning steps even 
before image parameters are obtained. Quality comparisons using real data are 
also influenced by the unavoidable changes in operation conditions, that are 
reflected in data corrections whose effect on separation methods are difficult 
to evaluate. A comparative study with comprehensive MAGIC data samples 
is, however, in preparation. 

For this comparison we used independent training and test samples, of 15000 
events each. Hadrons were simulated with the parameters: energy range 200Gel^ < 
E < 30TeV; spectral index a = —2.7; zenith angle range < 9 < 30°; im- 
pact parameter range < it! < 400m; viewing cone 5°. The gamma sim- 



12 



ulation settings were: energy range 50GeV < E < 30TeV; spectral index 
Crab-like a = —2.6; zenith angle range < ^ < 30°; impact parameter range 
< i? < 200m; Figure |6] shows the corresponding distributions of the image 
parameters width [deg] and length [deg] as functions of size [phe], for gammas 
and hadrons. All data were pre-cut to obtain high-quality training and test 
samples, requiring leakage^ < 0.1, dist> 0.3°, size > 200phe. Clearly, width 



MC gammas 



MC hadrons 




4.5 

log,„(size[phe]) 



4.5 

log„(size[phe]) 



Fig. 6. Distribution of the Hillas parameters width (top) and length (bottom) as 
function of log(size), for gammas (left) and hadrons (right), as used in the training 
samples. The profiles are shown in red (gammas) and black (hadrons), showing 
that both parameters are good separators for size values above 200 photoelectrons 
(corresponding to about 100 GeV) 

and length are good separation parameters, at least for values of size exceed- 
ing 200 phe (photo electrons), which corresponds approximately to energies 
above lOOGeV^. The size dependence of width and length can be dealt with by 
using scaled parameters: The size range (of MC gamma data) is divided into 
bins, and for each bin i mean and variance of the width distribution {wi and 
(T^J are calculated. The scaled width Wi^scaied for each bin is then obtained by 

Wi^scaled = {Wi - Wi)/^^,^. 

The same procedure is used for the length parameter. As a result one obtains a 
normalized width and length distribution for gammas: they follow a pdf (prob- 
ability density function) with mean and variance 1. In these variables, static 
(size-independent) cuts are used for 7/h separation. In order to find optimal 
cuts, a maximization of the Q-value which relates the relative acceptances of 
gamma-rays and hadrons {Q = e^/ y^) was performed, using the Metropolis 
minimization packag^^ followed by a SIMPLEX minimization. Both packages 



^ this parameter, not defined in [6], uses an estimate of fractional energy escaping 
the camera 

^ which includes random perturbations in the search, thus avoiding to return local 
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are part of TMinuit in the root analysis environment [15]. 

Both the Random Forest and the scaled parameter method used independent 
data for training and testing. Only the parameters size, dist, width, and length 
were used. The results are compared in the Neyman-Pearson or ROC (Receiver 
operator characteristic) diagrams of figure [7j these diagrams show gamma 
acceptance as function of hadron acceptance. In order to obtain for the scaled 




Fig. 7. ROC curves for ^/h separation in the test sample, by the RF method (higher 
curve) and by cuts in scaled parameters, using the same parameters. 



parameter method more than a single point (that of overall maximum Q) in 
the ROC diagram, a regularizer a{eh — pY was introduced (a generalization 
of the method used in [6]). Here p denotes a target acceptance for hadrons, 
and eh is the freely variable hadron acceptance, which is obtained from the 
maximization of Q and different for each p. We used a high scaling number 
a = 1000 to ensure that the optimization will give as a result a set of cuts 
with eh close to p. 

These results are shown as the lower curve in figure [7| We should stress again 
that this comparison can in no way show a general superiority of the RF 
method; practical experience shows that for a given data sample other methods 
(also including direct selection as in the above example) can, at an effort, be 
fine-tuned to give results comparable to the RF method. However, in no case 
has the RF result been shown inferior, and much less tuning is needed (and 
possible) with the RF method. More comparisons (including also MAGIC 
data) can be found in [16]. 



minima 
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5 Using a Random Forest estimator for a continuous variable 



The RF method permits also to construct an algorithm of estimating a contin- 
uous quantity rather than a discrete class membership, dealt with in previous 
sections. We have used this method to estimate non-analytically the parti- 
cle energy from the measured image parameters. Two main approaches are 
possible: 

• Forced division into classes: Class labels are assigned to the training events 
according to an energy grid. As a result, multiple classes Eq,Ei, En-i are 
created. In the RF training process the related class populations are taken 
into account together with a more general Gini index [9] 

Pi = Ni/N (5) 

n-l 

QGini^'^-^p'l (6) 
i=0 

Here i is the class index (0 < i < n — 1). As already shown above, the 
Gini index of a split is evaluated as sum of the two Gini indices obtained 
after the split, and minimized. After the training procedure, the class pop- 
ulations inside a terminal node are used to calculate the estimated energy 
corresponding to the terminal node: 

In this application of RF each tree returns an estimated energy and the 
overall mean is calculated as the final estimated energy. 

• A splitting rule based on the continuous quantity: It is possible to completely 
avoid the use of classes by introducing a splitting rule, which does not rely 
on class populations. The idea of the Gini index (with its interpretation as 
binomial variance of the classes) as split rule is a purification of the class 
populations, i.e. a separation of the classes, in the subsamples after the split 
process. Similarly, when using the variance in energy as a splitting criterion, 
the subsamples are purified with respect to their energy distribution. 



1 ^.o 1 



i=i 



n-l 



N 

] (8) 

\i=i 



In analogy to the Gini index of the spht, the variance of the split is calculated 
by adding the subsample energy variances, taking into account the node 
populations as weights: 

AE) ^ j^^^^ (N,al{E) + NnaliE)) (9) 
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We have used both approaches for a set of Monte Carlo data. With 100 classes 
for the first (classification) method, it produces results nearly identical to those 
of the second (regression) approach. The results of this latter RF approxima- 
tion for energy can be seen from figure |8j The linearity is perfect, and the en- 
ergy resolution (as defined by the rms error cje/ E) comes out 26% at 100 GeV 
and 19% at 1 TeV, very fair values for a single telescope (telescope arrays can 
reach better resolution). We have not found an analytical parameterization for 
energy expressed in terms of image parameters giving a result better than with 
the RF representation; with extensive tuning, results comparable in quality 
have been found, though. 




3.5 4 1.5 2 2.5 3 3.5 4 

log,„(EJGeV]) log„(EJGeV]) 



Fig. 8. Left: The relation between the RF-estimated energy (horizontal) and initial 
Monte Carlo energy (vertical axis) is perfectly linear. Right: The rms error ge/E 
as function of initial energy. 



6 Conclusions 



The Random Forest (RF) method based on multiple decision trees was exten- 
sively tested as an analysis tool in the 7/h separation for data obtained with 
the MAGIC telescope. In this paper we discuss many implementation details 
and the parameters a user has to become familiar with. We also compare the 
performance of RF with the more conventional technique of cuts in scaled im- 
age parameters, using MC data. It could be shown that RF in this comparison 
is superior to the classical method. This comparison does not imply a general 
superiority of the RF method; practical experience shows that for a given data 
sample the conventional methods (like dynamical cuts or cuts in scaled image 
parameters) may be tuned to give results comparable (but not superior) to 
the RF method. A dedicated comparative study using MAGIC experimental 
data is still under way. 

The RF method does produce stable results and is robust with respect to 
input parameters, even if strongly correlated. The method adjusts itself to the 
available multi-dimensional space, with a minimum of human intervention: 
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there are only few tunable parameters, which can be chosen according to 
simple criteria (number of trees, trials in random split selection and final 
node size). This simpler control and tuning can then be seen as a general 
advantage over conventional methods. Proper training samples, however, are 
important, as in any advanced method requiring a training process, i.e. one has 
to rely on a good Monte Carlo simulation. Using OFF or ON data as hadron 
sample limits the MC dependence to the gamma showers, better understood 
than hadron showers. There remains, however, the need to correctly treat 
atmospheric conditions under different zenith angles, and good knowledge of 
the detector. 

Training and classification are fast: benchmarks using a 1.5 GHz PC (Athlon 
XP), with training and test samples each containing 10.000 events, a total 
of 10 image parameters used, 100 trees used for classification, each tree com- 
pletely grown (nodesize=l), 3 trials in random split selection, give one minute 
for training and 2 ms/event for classification. A comparable analysis technique 
like Neural Networks demands substantially more computer time for training. 
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